Machine Learning 2

Exercise 1

logistic regression and gradient descent

1 - Lecture notes exercise 1

Solve exercise 1 in the lecture notes.

...................................................................................................

In order to apply logistic regression we need to know how to optimize functions - in our case the logistic regression loss (3.11) in the lecture notes. If you already have experience in optimization you may not need the following two assignments.

  • Thanks to David for task 2 and 3.

2 - Calculate some gradients

a) Calculate the gradients of the following functions

$$f(x, y) = \frac{1}{x^2+y^2}$$

and $$f(x, y) = x^2y.$$

b) A standard way to computationally find a minimum is gradient descent.
Start at some (possibly random) point $ \overrightarrow{p}=(x,y)^T $ and move downwards, i.e. in negative gradient direction. The stepsize $\lambda$ should be controlled or small enough. When a Loss function is optimized in Machine Learning context $\lambda$ is also called the Learning Rate.

The update equation

$$ \overrightarrow{p_{i+1}}= \overrightarrow{p_{i}} - \lambda \cdot \nabla f(x,y)$$

is then iterated until the norm of the gradient is below some threshold.

Write down the update equations for the two functions in a)!

3 - Visualization of gradient descent

For this task we use the double well potential

$$V(x) = ax^4 + bx^2 + cx + d$$

with $a = 1$, $b = -3$, $c =1$ and $d = 3.514$.

We seek to find the global minimum $x_{min}$ of this function with gradient descent. (In 1D the gradient is just the derivative.)

a) Calculate the derivative of $V(x)$ and the update equation for $x$ with learning rate $\lambda$.

b) Complete the code below.

c) Test the different starting points and $\lambda$:

$$(x_0, \lambda) = (-1.75, 0.001)$$$$(x_0, \lambda) = (-1.75, 0.19) $$$$(x_0, \lambda) = (-1.75, 0.1) $$$$(x_0, \lambda) = (-1.75, 0.205)$$

d) How to actually find a compromize between $(x_0, \lambda) = (-1.75, 0.001)$ and $(x_0, \lambda) = (-1.75, 0.19)$ ?

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def update2(x, a, b, c, d, lam):
  x = ___

  return x

def V(x, a, b, c, d):
  return a*x**4 + b*x**2 + c*x + d

a = 1
b = -3
c = 1
d = 3.514

x0 = -1.75
iterations = 101
lams = np.array([0.001, 0.19, 0.1, 0.205])

losses = np.empty(shape=(iterations, len(lams)))
results = np.empty(len(lams))

for j in range(len(lams)):
  x = x0
  lam = lams[j]
  for i in range(iterations):
    losses[i, j] = V(x, a, b, c, d)
    if i != iterations - 1:
      x = update2(x, a, b, c, d, lam)
  results[j] = x

for j in range(len(lams)):
  print(100*"-")
  print("Lambda: ", lams[j])
  print("xmin: ", results[j])
  print("Loss: ", V(results[j], a, b, c, d))

colors = {
    0.001: "blue",
    0.19: "red",
    0.1: "black",
    0.205: "orange"
}

plt.figure(figsize=(8, 8))
plt.title("Learning curves")
plt.xlabel("Epoch")
plt.ylabel("Loss V")
plt.xlim(0, iterations)

for i in range(len(lams)):
  lam = lams[i]
  plt.plot(range(iterations), losses[:, i], label=str(lam), color=colors[lam])

plt.legend()
plt.ylim(bottom=0)
plt.show()

plt.figure(figsize=(8, 8))
plt.title("Function V and Minima")
plt.xlabel("x")
plt.ylabel("V(x)")

xs = np.linspace(-2, 2, 100)
ys = V(xs, a, b, c, d)

plt.plot(xs, ys)

for j in range(len(lams)):
  lam = lams[j]
  xmin = results[j]
  vxmin = V(xmin, a, b, c, d)
  plt.plot(xmin, vxmin, marker='.', linestyle="None", label=str(lam), color=colors[lam], ms=10)
plt.legend()
plt.show()

4 - Logistic Regression

Consider two 1D Normal Distributions with $\sigma^2=1$ located at $\mu_1=0.0$ and $\mu_2=2.0$. Sample N values from each of these distributions and assign class label "0" and "1" to the values ("0" for the values coming from the normal distribution at "0"). Let this be your labeled data. Learn a logistic regression model with these data. Choose N=5 and N=100.

At which location is the 50% decision for your class label beeing "0" (and "1")?

Hints:

  • data1 = numpy.random.normal(mu1, sigma1, sizeN)
  • You see from the question how to choose your linear model (3.1) in the lecture notes: there is a const term $\theta_0$, i.e. (3.1.) becomes $<\theta, \hat x> = (\theta_1, \theta_0) \cdot (x,1)^T=\theta_1 \cdot x +\theta_0$.

5 - Logistic Regression scikit-learn example

Run and understand the example "MNIST classification using multinomial logistic regression" from scikit-learn.